This project aims to build a predictive model that can predict the probability that a particular claim will be approved immediately or not by the insurance company.
The evaluation metric is the log loss.
See the README.md file and competitions' page for further details.
# Loading useful Python packages for Data cleaning and Pre-processing
import numpy as np
import pandas as pd
import pandas_profiling
import matplotlib.pyplot as plt
import category_encoders as ce
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore')
pd.set_option('display.max_columns', 150)
# loading datasets
train_df = pd.read_csv('data/dataset_treino.csv')
test_df = pd.read_csv('data/dataset_teste.csv')
train_df.head()
In the following lines, we'll perform several modifications in the datasets, to evaluate the impact of such modifications we'll save each version of the datasets as an object in a dictionary.
data = {}
data['original'] = {'train': train_df, 'test': test_df}
The first step before perform any kind of statistical analysis and modeling is to clean the data.
Let's see the type of data we have.
train_df.info()
From the above, we can see that this data set has 114321 rows and 133 columns.
Also, we have 114 numerical features (columns) and 19 categorical features.
Let's see if we have null values (also know as NaN)
# There are null values?
train_df.isnull().values.any()
# Null values amount for each column
train_df.isnull().sum().sort_values(ascending=False)
So, we have a lot of null values in several columns.
Let's check the percentage of null values for each column.
null_values = train_df.isnull().sum()
null_values = round((null_values/train_df.shape[0] * 100), 2)
null_values.sort_values(ascending=False)
Considering that we are dealing with anonymous data and we can't know the meaning of the data, I'll remove all columns with more than 40% of null values.
# Get the names of the columns that have more than 40% of null values
high_nan_rate_columns = null_values[null_values > 40].index
# Make a copy if the original datasets and remove the columns
train_df_cleaned = train_df.copy()
test_df_cleaned = test_df.copy()
train_df_cleaned.drop(high_nan_rate_columns, axis=1, inplace=True)
test_df_cleaned.drop(high_nan_rate_columns, axis=1, inplace=True)
# Remove the ID column (it is not useful for modeling)
train_df_cleaned.drop(['ID'], axis=1, inplace=True)
train_df_cleaned.info()
Now we have only 30 columns in the data set.
But we still have null values that need to be handled.
null_values_columns = train_df_cleaned.isnull().sum().sort_values(ascending=False)
null_values_columns = null_values_columns[null_values_columns > 0]
null_values_columns
train_df_cleaned[null_values_columns.index].info()
From the above, there are 8 numeric columns and 9 categorical columns with nulls values.
For now, we will replace the null values by the MEAN value for each numeric column and for the MODE for each of the categorical columns.
###### TRAIN DATASET ######
##### Numerical columns
null_values_columns_train = train_df_cleaned.isnull().sum().sort_values(ascending=False)
numerical_col_null_values = train_df_cleaned[null_values_columns_train.index].select_dtypes(include=['float64', 'int64']).columns
# for each column
for c in numerical_col_null_values:
# Get the mean
mean = train_df_cleaned[c].mean()
# replace the NaN by mode
train_df_cleaned[c].fillna(mean, inplace=True)
##### Categorical columns
categ_col_null_values = train_df_cleaned[null_values_columns_train.index].select_dtypes(include=['object']).columns
# for each column
for c in categ_col_null_values:
# Get the most frequent value (mode)
mode = train_df_cleaned[c].value_counts().index[0]
# replace the NaN by mode
train_df_cleaned[c].fillna(mode, inplace=True)
###### TEST DATASET ######
##### Numerical columns
null_values_columns_test = test_df_cleaned.isnull().sum().sort_values(ascending=False)
#print(null_values_columns_test)
numerical_col_null_values = list(test_df_cleaned[null_values_columns_test.index].select_dtypes(include=['float64', 'int64']).columns)
numerical_col_null_values.remove('ID')
# for each column
for c in numerical_col_null_values:
# Get the mean
mean = test_df_cleaned[c].mean()
# replace the NaN by mode
test_df_cleaned[c].fillna(mean, inplace=True)
##### Categorical columns
categ_col_null_values = test_df_cleaned[null_values_columns_test.index].select_dtypes(include=['object']).columns
# for each column
for c in categ_col_null_values:
# Get the most frequent value (mode)
mode = test_df_cleaned[c].value_counts().index[0]
# replace the NaN by mode
test_df_cleaned[c].fillna(mode, inplace=True)
# There are null values?
print(train_df_cleaned.isnull().values.any())
print(test_df_cleaned.isnull().values.any())
# Save the list of current columns
selected_columns = list(train_df_cleaned.columns)
selected_columns_test = selected_columns[:]
selected_columns_test.remove('target')
selected_columns_test.append('ID')
# Filter the columns in the test dataset
test_df_cleaned = test_df_cleaned[list(selected_columns_test)]
# Save the datasets in dict
data['cleaned_v1'] = {'train': train_df_cleaned.copy(), 'test':test_df_cleaned.copy()}
Now that the dataset is cleaned, let's compute some statistics about the data and perform the transformations.
We'll use the Pandas Profiling library to create a report about the data.
%%time
train_df_cleaned.profile_report(style={'full_width':True})